NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Toward Carbon-Aware Data Transfers

https://doi.org/10.1109/MIC.2025.3565187

Goldverg, Jacob; Jamil, Hasibul; Rodrigues, Elvis; Kosar, Tevfik (March 2025, IEEE Internet Computing)

The growing adoption of cloud, edge, and distributed computing, as well as the rise in the use of AI/ML workloads, have created a significant need to measure, monitor, and reduce the carbon emissions associated with these resource-intensive tasks. One significant but often overlooked source of emissions is data transfers over wide-area networks (WANs), primarily due to the challenges in accurately measuring the carbon footprint of end-to-end network paths. We introduce a novel mechanism to measure network carbon footprints and propose strategies for optimizing the scheduling of network-intensive tasks. We show that users can achieve significant carbon savings by shifting data transfer tasks across time and geographic regions based on local carbon intensity.
more » « less
Full Text Available
FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters

https://doi.org/10.1109/ICC52391.2025.11160830

Jamil, Hasibul; Alim, Abdul; Schares, Laurent; Maniotis, Pavlos; Schour, Liran; Sydney, Ali; Kayi, Abdullah; Kosar, Tevfik; Karacali, Bengi (June 2025, IEEE International Conference on Communications (ICC 2025))

The increasing complexity of AI workloads, especially distributed Large Language Model (LLM) training, places significant strain on the networking infrastructure of parallel data centers and supercomputing systems. While Equal-Cost Multi-Path (ECMP) routing distributes traffic over parallel paths, hash collisions often lead to imbalanced network resource utilization and performance bottlenecks. This paper presents FlowTracer, a tool designed to analyze network path utilization and evaluate different routing strategies. Unlike tools that introduce additional traffic, FlowTracer aids in debugging network inefficiencies by passively monitoring and correlating user workload flows. As a result, FlowTracer does not interfere with ongoing data transfers, enabling analysis with minimal overhead, which is an important factor when debugging and fine-tuning routing schemes in production systems. FlowTracer can provide detailed insights into traffic distribution and can help identify the root causes of performance degradation, such as hash collisions. With FlowTracer’s flow-level insights, system operators can optimize routing, reduce congestion, and improve the performance of distributed AI workloads. We use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP routing against a statically configured network. The example showcases a 30% reduction in imbalance, as measured by a new metric we introduce.
more » « less
Full Text Available
Learning to Maximize Network Bandwidth Utilization with Deep Reinforcement Learning

https://doi.org/10.1109/GLOBECOM54140.2023.10437507

Jamil, Hasibul; Rodrigues, Elvis; Goldverg, Jacob; Kosar, Tevfik (December 2023, IEEE)

Efficiently transferring data over long-distance, high-speed networks requires optimal utilization of available network bandwidth. One effective method to achieve this is through the use of parallel TCP streams. This approach allows applications to leverage network parallelism, thereby enhancing transfer throughput. However, determining the ideal number of parallel TCP streams can be challenging due to non-deterministic background traffic sharing the network, as well as non-stationary and partially observable network signals. We present a novel learning-based approach that utilizes deep reinforcement learning (DRL) to determine the optimal number of parallel TCP streams. Our DRL-based algorithm is designed to intelligently utilize available network bandwidth while adapting to different network conditions. Unlike rule-based heuristics, which lack generalization in unknown network scenarios, our DRL-based solution can dynamically adjust the parallel TCP stream numbers to optimize network bandwidth utilization without causing network congestion and ensuring fairness among competing transfers. We conducted extensive experiments to evaluate our DRL-based algorithm’s performance and compared it with several state-of-the-art online optimization algorithms. The results demonstrate that our algorithm can identify nearly optimal solutions 40% faster while achieving up to 15% higher throughput. Furthermore, we show that our solution can prevent network congestion and distribute the available network resources fairly among competing transfers, unlike a discriminatory algorithm.
more » « less
Full Text Available
Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming

https://doi.org/10.1145/3624062.3624593

Jamil, Hasibul; Chung, Joaquin; Bicer, Tekin; Kosar, Tevfik; Kettimuthu, Rajkumar (November 2023, ACM)

Full Text Available
Energy-Efficient Data Transfer Optimization via Decision-Tree Based Uncertainty Reduction

https://doi.org/10.1109/ICCCN54977.2022.9868866

Jamil, Hasibul; Rodolph, Lavone; Goldverg, Jacob; Kosar, Tevfik (July 2022, 2022 International Conference on Computer Communications and Networks (ICCCN))

The increase and rapid growth of data produced by scientific instruments, the Internet of Things (IoT), and social media is causing data transfer performance and resource consumption to garner much attention in the research community. The network infrastructure and end systems that enable this extensive data movement use a substantial amount of electricity, measured in terawatt-hours per year. Managing energy consumption within the core networking infrastructure is an active research area, but there is a limited amount of work on reducing power consumption at the end systems during active data transfers. This paper presents a novel two-phase dynamic throughput and energy optimization model that utilizes an offline decision-search-tree based clustering technique to encapsulate and categorize historical data transfer log information and an online search optimization algorithm to find the best application and kernel layer parameter combination to maximize the achieved data transfer throughput while minimizing the energy consumption. Our model also incorporates an ensemble method to reduce aleatoric uncertainty in finding optimal application and kernel layer parameters during the offline analysis phase. The experimental evaluation results show that our decision-tree based model outperforms the state-of-the-art solutions in this area by achieving 117% higher throughput on average and also consuming 19% less energy at the end systems during active data transfers.
more » « less
Full Text Available

Search for: All records